This post discusses various metrics for evaluating model performance in R, including recall, precision, F1 score, and kappa statistics, using tidyverse and pipes with a unified dataset approach.. Additionally, we highlight the importance of correctly specifying factor levels to ensure valid metric calculations..
Disclaimer: This post is written by an AI language model based on R code provided by the author. The purpose is to document and explain R techniques for personal reference.
Evaluating model performance is crucial for understanding how well your machine learning models are working.. In this post, we’ll explore different metrics, including recall, precision, F1 score, and kappa statistics, which can help assess the accuracy and reliability of your models.. We’ll simplify the implementation using the tidyverse package and pipes, assuming you have a dataset named prediction for predicted values and correct for actual values with matching variable names.. Additionally, we emphasize the importance of correctly specifying factor levels when working with binary classification data.. Incorrect level ordering can lead to invalid or misleading metric calculations, which we will demonstrate and address..
We’ll use the pacman package to load irr, caret, and tidyverse for calculating metrics and managing data efficiently..
pacman::p_load("irr", "caret", "tidyverse", "gt")We’ll create example datasets prediction and correct to demonstrate the evaluation process.. These datasets will have matching variable names and contain binary classification data.. Important Note: The factor levels must be correctly specified, with 1 representing the positive class and 0 the negative class.. If the levels are reversed (e.g., levels = 0:1), the metrics will be computed incorrectly..
set.seed(123)
# Create example datasets
prediction <- tibble(
Formål_1 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_2 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_3 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_4 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_5 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_2 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_3 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_4 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_5 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_6 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0))
)
correct <- tibble(
Formål_1 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_2 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_3 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_4 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Formål_5 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_2 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_3 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_4 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_5 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0)),
Avsender_6 = factor(sample(c(0, 1), 100, replace = TRUE), levels = c(1, 0))
)When factor levels are specified incorrectly (e.g., levels = 0:1), the positive and negative classes are reversed.. This can lead to incorrect metric calculations, as the model will treat the negative class as the positive class.. For instance:
incorrect_prediction <- factor(sample(c(0, 1), 100, replace = TRUE), levels = c(0, 1))
incorrect_reference <- factor(sample(c(0, 1), 100, replace = TRUE), levels = c(0, 1))
# Incorrect calculation
recall(data = incorrect_prediction, reference = incorrect_reference)[1] 0.5961538compute_metrics <- function(variable_name, prediction_data, correct_data) {
tibble(
Variabel = variable_name,
rec = recall(data = prediction_data[[variable_name]], reference = correct_data[[variable_name]]),
F1 = F_meas(data = prediction_data[[variable_name]], reference = correct_data[[variable_name]]),
prec = precision(data = prediction_data[[variable_name]], reference = correct_data[[variable_name]])
)
}
variables <- colnames(prediction)
results <- map_df(variables, ~ compute_metrics(.x, prediction, correct))compute_kappa_agreement <- function(variable_name, prediction_data, correct_data) {
tibble(
Variabel = variable_name,
Kappa = kappa2(cbind(prediction_data[[variable_name]], correct_data[[variable_name]]))$value,
Agreement = agree(cbind(prediction_data[[variable_name]], correct_data[[variable_name]]))$value
)
}
kappa_agreement_results <- map_df(variables, ~ compute_kappa_agreement(.x, prediction, correct))final_results <- results %>%
left_join(kappa_agreement_results, by = "Variabel")
gt(final_results) %>% fmt_number(2:5)| Variabel | rec | F1 | prec | Kappa | Agreement | 
|---|---|---|---|---|---|
| Formål_1 | 0.40 | 0.43 | 0.47 | −0.06 | 47 | 
| Formål_2 | 0.67 | 0.64 | 0.61 | 0.26 | 63 | 
| Formål_3 | 0.47 | 0.47 | 0.47 | −0.04 | 48 | 
| Formål_4 | 0.45 | 0.46 | 0.47 | 0.04 | 53 | 
| Formål_5 | 0.53 | 0.52 | 0.50 | 0.06 | 53 | 
| Avsender_2 | 0.56 | 0.54 | 0.53 | −0.03 | 49 | 
| Avsender_3 | 0.53 | 0.49 | 0.45 | 0.01 | 50 | 
| Avsender_4 | 0.49 | 0.50 | 0.51 | 0.00 | 50 | 
| Avsender_5 | 0.41 | 0.43 | 0.45 | −0.06 | 47 | 
| Avsender_6 | 0.49 | 0.46 | 0.44 | −0.06 | 47 | 
For attribution, please cite this work as
Solheim & Writer) (2025, April 1). Solheim: Evaluating Model Performance: Recall, Precision, F1 Score, and Kappa using Tidyverse. Retrieved from https://www.oyvindsolheim.com/library/Evaluating model performance/
BibTeX citation
@misc{solheim2025evaluating,
  author = {Solheim, Øyvind Bugge and Writer), ChatGPT (Ghost},
  title = {Solheim: Evaluating Model Performance: Recall, Precision, F1 Score, and Kappa using Tidyverse},
  url = {https://www.oyvindsolheim.com/library/Evaluating model performance/},
  year = {2025}
}